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The heavy involvement of the Arabic internet users resulted in spreading 
data written in the Arabic language and creating a vast research area 
regarding natural language processing (NLP). Sentiment analysis is a 
growing field of research that is of great importance to everyone considering 
the high added potential for decision-making and predicting upcoming 


actions using the texts produced in social networks. Arabic used in 
microblogging websites, especially Twitter, is highly informal. It is not 
Keywords: compliant with neither standards nor spelling regulations making it quite 
challenging for automatic machine-learning techniques. In this paper’s 
AutoML ‘ 
: scope, we propose a new approach based on AutoML methods to improve 
Informal Arabic the efficiency of the sentiment classification process for dialectal Arabic. 
Polarity detection This approach was validated through benchmarks testing on three different 
Sentiment analysis datasets that represent three vernacular forms of Arabic. The obtained results 
Tree-based optimization tool show that the presented framework has significantly increased accuracy than 
similar works in the literature. 
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1. INTRODUCTION 

Sentiment analysis is a very active field of research positioned at the crossroads of natural language 
processing (NLP), text mining, and machine learning techniques. It involves analyzing a piece of text to 
retrieve the attitudes [1] and behavioral insights about an entity (product, feature, and service) or a feature of 
an entity. Subsequently, sentiment analysis tackles the study of the text in order to attribute a category of 
sentiment orientation [2], most commonly by detecting positive (favorable) or negative (unfavorable) 
polarity. 

Since the substantial explosion of the world wide web, Internet users are not just information 
recipients anymore, but they contribute to effectively building up a myriad of publicly available content. This 
opinionated data is shared over social media within different user communities, where it spreads viewpoints 
about various topics such as politics, education, health systems, and product quality. Such subjective opinions 
may even alter the perception of reality and lead to a contention regarding controversial subjects. An example 
of this was the Arab spring when the Arab world was vigorously shaken in 2011 by a movement emanating 
from claims on the social networks from societies contesting authoritarian modalities of governance they 
have been undergoing over the past decades 

Therefore, states and organizations no longer take for granted the data disseminated on the web, and 
special attention is paid to social media to track and even monitor the commonly shared information. In 
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particular, the content created by Arab internet users is increasingly growing due to the rising use of the 
internet and online services spurred by the coronavirus disease (COVID-19) context. As though, scientific 
papers dealing with sentiment analysis for the Arabic language have experienced a strong growth in recent 
years [3]-[5]. Researchers are addressing the issues of sentiment analysis to overcome the scarcity of tools 
and corpora available for the three forms of Arabic language, namely Standard Arabic, Classical Arabic and 
Dialectal Arabic [6]. Translated with www.DeepL.com/Translator (free version) this paper aims to propose a 
model based on the tree-based optimization tool (TPOT) [7] to improve hyperparameters of machine learning 
algorithms. We use this model for the polarity classification of sentiments derived from tweets in the 
dialectal Arabic language. 

The remainder of this paper is organized: section 2 gives an overview of the existing works related 
to sentiment analysis methods then highlights the most challenging issues of processing the Arabic language. 
In section 3, the related works are presented with emphasis on AutoML frameworks, especially the TPOT. 
Next, in section 4, we detail the materials and proposed method to show the results in section 5. Lastly, we 
finish with a conclusion and future works. 


2. BACKGROUND RESEARCH 
2.1. Sentiment analysis overview 

The applications of sentiment analysis have a broad landscape ranging from marketing to politics. 
With the era of the contextual pandemic COVID-19, almost all activities have become digital, forcing people 
to search for online services and thus to consult the opinions of other users who may have already tested 
them. People and companies are interested in knowing how many positive reviews there are about a product, 
a company, or a feature. Sentiment analysis encompasses the study of different granularities of a piece of 
text. Namely, we distinguish the document level, the sentence level, and the aspect level. The document level 
refers to an overall sentiment as expressed through the entire text, whether the document consists of review, 
comment, and tweet. A document can be written in several phrases, and we assume that a document only 
expresses a unique sentiment regarding a specific subject. Conversely, the sentence level is defined by 
studying a sentence to infer its subjectivity and subsequently identify the sentiment and opinion it conveys. 
On the other side, the aspect level, sometimes referred to as the feature level, is the most fine-grained layer 
since it allows highlighting the sentiment towards a given target. 

In all three levels, sentiment analysis can be conducted through three methods, automated machine- 
learning techniques, lexicon-based approaches, and hybrid ones [8]. Machine learning methods are further 
categorized into supervised, unsupervised, and semi-supervised. Supervised learning considers the use of 
training documents to train algorithms that will classify test documents. The most known supervised 
algorithms for the sentiment analysis task are support vector machines (SVM), naive Bayes (NB), decision 
tree (DT), and maximum entropy (ME) [9]-[12]. In contrast, unsupervised learning allows detecting common 
elements to group similar documents into clusters without having a training set. The k-means clustering 
algorithm is prominently used for that. Semi-supervised learning incorporates both labeled and unlabeled 
data to perform sentiment classification. The most commonly used algorithms for semi-supervised learning 
are the ensemble approaches such as boosting, bagging, and random forest (RF). The lexicon-based approach 
is based on a lexicon composed of a collection of terms. Each term conveys a known sentiment. This 
approach includes dictionary-based and corpus-based techniques. Finally, the hybrid approaches rely on 
using machine learning methods combined with sentiment lexicons to enhance classification accuracy. 

Sentiment classification addresses the task of classifying documents under two or multiple 
categories. When two categories are involved, then the task consists of detecting polarity (positive/negative). 
Polarity detection is referred to as the binary classification task [13]. The ternary classification is also widely 
applied by adding the neutral/objective class. When there are more than three classes, it is called multi-way 
classification [14], where we can classify the materials according to emotional intensity. Moreover, it is 
possible to use other classes, including, for example, the sarcasm class or the mixed class. 

Figure | summarizes the different techniques used for sentiment analysis. More recently, efforts 
have been made to create resources and tools in the discipline of affective computing and sentiment analysis 
(ACSA) [15], which focuses on emotion recognition, subjectivity detection and opinion target identification. 
SenticNet is among the most used resources in ACSA, it is interested in developing intelligent algorithms 
based on the concept-level knowledge, the objective is to tackle the cognitive and affective aspect of natural 
language that is not covered by only machine learning algorithms. The last released version of SenticNet [16] 
covers more than 100,000 commonsense concepts in the english language, It represents data by a semantic 
perspective instead of using a syntactical methodology. 

Although the SenticNet initiative has significantly advanced the handling of ACSA-related tasks, 
one major constraint remains that it focuses entirely on English [17]. With the expansion of social 
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networking worldwide, many other languages are becoming more prominent in the internet databank. 
However, these languages have not gained enough interest and the works dedicated to them are still barely 
noticeable in comparison with English. Vilares et al. [18] proposed BabelSenticNet to handling 40 languages 
including Arabic. BabelSenticNet is created in two steps, a first version is based on statistical machine 
translation, then a maching between senticnet concepts and wordnet synsets is performed to ensure accuracy 
at the concept level. 

A similar work was done in [19] where a concept based sentiment analysis system is proposed to 
handle Arabic concepts. The construction of Arabic SenticNet lexicon consists of two stages, the two-way 
translation and the process of extending the Arabic version of Wordnet senses. The system embeds the usage 
of a rule based semantic parser to comply with grammatical and morphology requirements of Arabic 
language. Although the proposed system achieved a 93% F-score by using the concept, the lexical and the 
word 2vec features, the tests were carried out on data set of news articles written in the modern standard 
Arabic, which unlike the informal dialect Arabic, have more structured and convenient phrases to be 
transformed in concepts. 


Semi-Supervised 


Machine 
Learning 
Methods 


Lexicon- 
Based 
Approaches 


Sentiment 


\ Analysis 


Dataset 


Figure 1. Techniques of sentiment analysis 


2.2. Informal Arabic challenges 

Arabic is a Semitic language widely used by more than 400 million people worldwide, of whom 
around 183 million are active internet users, placing Arabic in the fourth rank of the most used languages on 
the web. Arabic consists of 28 letters with different shapes according to their place, characterized by the 
absence of upper case letters and written from right to left, unlike English. There are three types of Arabic, 
classic Arabic being the Holy Quran's language, modern standard Arabic, the official language for education, 
the news, and all formal circumstances or events, and colloquial or informal Arabic is the simple way people 
talk to one another. All three Arabic types have some common morphological characteristics and are 
different in orthography, grammar, and lexicon [20]. The colloquial Arabic varies greatly depending on the 
geographical area. Generally, we consider five significant dialectal Arabic, the Egyptian, the Levantine, the 

Maghrebian, the Peninsular, and the Mesopotamian. Also, within every single dialect, several varieties exist, 

according to regions. Below, we flesh out some of the main challenges in processing colloquial Arabic text to 

analyze sentiments: 

— Morphology: Arabic is considered to be a morphologically rich language (MRL) [21] due to its 
agglutinative and highly inflectional character compared to other languages [22]. This strong 
agglutination in Arabic generates an abundance of new words based upon a single morpheme such as 
stem or root. A morpheme represents the smallest significant letters unit [23] by appending clitics to a 
root; multiple words can are produced change form and shape according to their position within a 
sentence. For example, from the root kharaja/z J>, we can add affixes to create other verbs or nouns. 
Those affixes can be infixes, prefixes, or suffixes. The following Table 1 shows different words 
deduced from this root. 

— Orthography and transliteration: Given the diverse variety of Arabic dialects, every dialect has 
distinctive orthography and lexicon. A unique word may be spelled in different manners for each dialect 
and even inside the same dialect. Thus, one may find several words that have the same meaning but 
spelled differently. Such a problem also arises when transliterating a word from another foreign 
language. 
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—  Arabizi: Arabizi is a recent phenomenon in writing the Arabic language related to expanding social 
media and live chats. It consists of writing Arabic by using Latin alphabets while adding digits to match 
existing Arabic letters missing in Latin. 

— Unstructured words: The most prevalent form of Arabic used on the internet is dialectal since users are 
increasingly providing more content in their communication language. While there is no standardized 
format for the various Arabic dialects, two persons can spell a given word very differently, thereby 
rendering the operation of getting the root of the word quite arduous. For instance, in the Moroccan 
dialect, one can write makanbghish/U##IS“s or makanbish/ th, which means "I do not appreciate". 
Another common practice is to duplicate letters or tatweel [24] to stretch some Arabic letters. 


Table 1. Example of the derivational caracter of the Arabic language 


Root Affixe  Affixe type New word Meaning 
Ist/ >- Prefix Istakhraja/¢ j> To extract 
e : Aa/! Infixe Kharij/z Outside 
ate S Ta/S Prefix Takharaja/¢ >= To graduate 
Ou/\s Suffix Kharajou/!ss>4 They went out 


3. RELATED WORK 

The use of the Internet, social networks, and the internet of things has become inevitable for 
everyone, and the produced data represents an essential component for decision-making and analysis on 
general questions and concerns. Without data, no organization nor business can function today. It helps 
improve the decision making and give more insights about strategies and future projects. However, the large 
volume and the format of these data, derived from various sources, constitute significant difficulties for 
researchers and machine learning practitioners. 

Machine learning is not a trivial task. A thorough study is needed to detect the most accurate 
algorithms, hyperparameters, and efficient feature selection methods depending on each domain. That is the 
main reason that AutoML [25], [26] is a convenient alternative to achieve outstanding performance while 
saving time and effort of searching for the appropriate parameters. An interesting analysis is given by [27] 
comparing four AutoML tools with human performance over 13 commonly used datasets, and the obtained 
results were impressive as they show that AutoML tools outperform the machine learning process achieved 
by human data scientists in 4 of 13 tasks. 

Moreover, [28], [29] present two TPOT based methods for radar signal recognition, aiming to solve 
the existing problems of radar feature extraction and low recognition rate. TPOT is used to select and 
optimize classifier parameters to improve recognition accuracy. The experimental results of [28] enhance the 
overall radar signal recognition that reaches 94.42%. In contrast, Zhang ef al. [29] managed to maintain a 
TPOT accuracy beyond 96% under different Signal-to-noise ratio changes. 

Howard et al. [30] benchmark different feature text representation methods for social media posts 
derived from health forums to predict mental health states. They used TPOT and Auto-Sklearn to generate 
classifiers with features extracted from textual data. Another study was conducted in [31] to predict the 
clinical diagnosis of depressive individuals, and this study introduces the feature set selector (FSS). 

To specify subsets of the features as separate datasets in order to reduce the computational time of 
TPOT. Their study indicated that TPOT exceeded the tuned extreme gradient boosting (XGBoost) model 
while implementing FSS improved the results significantly. The work presented in [32] applied TPOT-based 
machine learning (ML) to predict angiographic diagnoses of coronary artery disease (CAD). It compared 
TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, and 
demonstrated through experiments the power of agnostic model selection performed with AutoML TPOT for 
predicting CAD diagnosis. Similarly, the study of [33] built a radionics model with TPOT to predict 
molecular parameters essential in diagnosing tumor entities. Most relevant features were extracted from 
fluid-attenuated inversion re-covery (FLAIR) images and used to generate ten separate TPOT models. We 
accomplish the steps of feature selection, model selection, and parameter optimization using TPOT. 
According to model comparison, TPOT helped to optimize the model parameters automatically and found 
valuable features to enhance the model performance. It predicted the lethal brain tumor encoded by the 
mutation of a histone named diffuse midline glioma (DMG) mutation status in patients with an accuracy of 
91.1%. 

Other studies of automated machine learning using TPOT concern different fields. The study 
conducted in [34] attempted to establish a learning architecture for forecasting and trading stock indices. 
Zhou et al. proposed a cascaded model and evaluated its effectiveness by comparing its performance with 
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other TPOT models. Additionally, Ahlgren et al. [35] proposed a machine learning approach based on TPOT 
to predict the dynamic fuel oil consumption. They demonstrated that an optimized model is a reliable tool for 
decision support systems. 


4. METHOD 

Machine learning has led to exciting results in many NLP tasks, pointing out the necessity of finding 
optimal hyper-parameters to optimize the algorithms and the preprocessing phase and selecting appropriate 
features. It requires specific knowledge and experience that depend on the domain of study, data type, and 
expected results. Nevertheless, the AutoM1] system has shown its effectiveness in many problems [36]. As far 
as sentiment analysis is concerned, we propose using an approach built on the TPOT [36], an iterative and 
powerful system that uses genetic programming techniques to optimize the pipeline and models. The 
components of the framework are described in Figure 2. 


/ a 


Informal 
Arabic 
dataset / 


Preprocessing m Tokenization 
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preprocessing Selection Selection Classification 


TPOT machine learning flow 


Figure 2. The system components for sentiment classification 


4.1. Tree-based pipeline 

TPOT will search across a wide range of preprocessors, feature constructors, feature selectors, 
models, and parameters to find a set of operators that minimize the error of model predictions. Some of these 
operators are complex and can be time-consuming to perform, especially for large datasets. In this study, we 
consider four operators: 

— Preprocessors: this operator scales the features using the mean and variance of the sample 
(StandardScaler), scales the features with the sample and the interquartile range (RobustScaler), and 
generates the interacting features by the polynomial combination of numerical features 
(PolynomialFeatures). When the number of characteristics is 4, and the degree is 2, the conversion by 
PolynomialFeatures can be expressed: 


15 
Èk=1 Xp = Xi X Xj (1) 


where i < j and j = 0,1,2,3,4.Wheni = j = 0 then x; = xj; = 1. 

— Decomposition: randomized principal component analysis [37] is applied to decompose the 
dimensionality reduction, using approximated singular value decomposition of the data and keeping 
only the most significant singular vectors to project the data to a lower-dimensional space. 

— Feature selection: in TPOT, many feature selection methods are implemented, such as Select KBest, 
SelectPercentile, and VarianceThreshold. It uses a linear pipeline that follows a specified structure 
starting by feature selection where we can specify the method and enable the FeatureSetSelector 
parameter to reduce TPOT computation time. 

— Model selection: TPOT was designed for supervised learning. The models integrate decision tree 
classifier, random forest classifier, gradient boosting classifier, support vector machine, logistic 
regression, and k-nearest neighbors classifier. 


4.2. Genetic programming 

Genetic programming is a wonderfully powerful technology that emerged in the 90s [38]. It is a type 
of evolutionary algorithms that addresses automatic programming and machine learning problems. The 
genetic programming paradigm is founded on natural selection and biological breeding derived from the 
Darwinian evolution of living organisms. In this paper, optimizing TPOT pipelines is performed with genetic 
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programming, as presented in Figure 3. Each pipeline operator consists of a set of functions, with each 
function receiving a set of parameters. Parameters settings are specified as shown in Table 2. 
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Figure 3. Optimizing machine learning pipelines with genetic programming 


Table 2. Genetic programming parameters 


Parameter Value 
Population Size 15 
Generations 5 
Mutation Rate 0.9 
Crossover Rate 0.1 
Selection 0.1 
Scoring Accuracy 


Population size: number of the tree-based pipelines to retain in the genetic programming population at 
the start of every generation. 

Generations: number of iterations to run the tree pipelines optimization process. The algorithm repeats 
this evaluation-selection-crossover-mutation process for 5 generations. 

Mutation rate: this parameter is in the range of [0.0, 1.0], it tells the algorithm how many pipelines to 
apply random changes for every generation. 

Crossover rate: this parameter is in the range of [0.0, 1.0]. It represents the number of times a crossover 
occurs for tree-based pipelines in one generation. 

Selection: this process determines which pipelines are allowed to survive and which pipelines are 
allowed to reproduce. Once a set of tree pipelines has been selected for further reproduction, the 
following operators are applied: reproduction, mutation, and crossover. 10% of the population is created 
from the best individuals that will constitute the new offsprings, and the tournament selection is used to 
determine the success rate of the population. 

Scoring: a scoring operation also called a fitness function, is applied to the process's outcome. We use 
accuracy in order to evaluate the quality of pipelines for classifying sentiments. 


EVALUATION 
There are mainly five vernacular forms of Arabic: Maghrebian Arabic in the North African region, 


Egyptian Arabic in the Nile region, Levantine Arabic, Peninsular Arabic in the Gulf region and 
Mesopotamian Arabic in the Iraqi region.This section presents the baseline of the pre-treatment phase 
performed on two datasets representing Arabic dialects among the most representative in the Arab World, 
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namely Moroccan and Egyptian dialects. We detail the vectorization method used and give the results for 
each configuration. 


5.1. Datasets 

Moroccan sentiment Twitter dataset (MSTD) [39] is a Moroccan dataset retrieved from tweets 
covering four-way sentiment classification. We are interested in the binary dataset. The second dataset 
Arabic sentiment Twitter dataset (ASTD) handles the Egyptian dialect and is characterized by its unbalanced 
settings as the positive class counts 799 tweets whilst the negative class counts more than 1,600 tweets. The 
third dataset addresses the Jordanian dialect, which we refer to as ArTwitter, and is a balanced dataset with 
1,000 tweets for each class. We consider 70% for training and 30% for test purposes for the three datasets. 
Table 3 shows the description of MSTD, ASTD [40], and ArTwitter [41] datasets: 


Table 3. Datasets description 


Dataset Dialect type Sentiment 
Positive Negative 
MSTD Moroccan 866 2769 
ASTD Egyptian 799 1684 
ArTwitter Jordanian 1000 1000 


5.2. Preprocessing phase 

Preprocessing text is so essential for all NLP tasks, especially when dealing with informal language. 
We first clean the text by removing noisy data such as punctuation, repeated letters, Urls, Html code, 
Hashtags, Usernames, and non-Arabic letters [42]. Next, we normalize text by replacing similar alphabet 
letters with one unique form, for example, "i Í i" are three forms of the same letter "'", then we deleted 
diacritics and stop words. For that purpose, we built in a stop words list from both Moroccan and Egyptian 
dialects that match the most common conjunctions, prepositions, and non sentimental words also used in 
Jordanian dialect. After tokenizing the documents, we carry out the last phase of preprocessing by stemming 
words to their root. The individual stress response index (ISRI) stemmer [43] is employed to reduce words 
into their stem. 


5.3. Term frequency-inverse document frequency (TF-IDF) vectorizer 

The TF-IDF weighting approach is commonly used in information retrieval and text mining. This 
statistical metric can be used to assess the significance of a term in a document in relation to a collection or 
corpus. The weight grows in direct proportion to the number of times the term appears in the document. It 
also fluctuates depending on the word's frequency in the corpus. TF-IDF aims to convert a collection of raw 
documents to a matrix of TF-IDF features to express the importance of a word to a document while 
considering the relation to other documents in the corpus. The following formula calculates the TF-IDF 
score: 


tfidf (w,d, D) = tf(w,d) x idf(w, D) (2) 
while 

f(w, d) = log(1 + f(w,d)) (3) 
and 

idf (w,D) = log(apy) (4) 


with f(w,d) is the frequency of word w in document d, N is the number of documents, and D is the 
collection of all documents. 


6. RESULTS AND DISCUSSION 

In this section, we present the experiments and the corresponding results. We first study the effect of 
changing the hyperparameters of the proposed framework on classification accuracy. All three datasets were 
prepared in the same manner, as explained in section 5. We show the effect of the presence or absence of the 
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stemming module, the TF-IDF with n-grams, and changing the evolutionary process parameters in TPOT. To 
make our results comparable with similar work in the literature, we used the accuracy metric, defined by: 


(TP+TN) 


Accuracy = —————_ 
y TP+TN+FP+FN 


(5) 
where TP: true positive, TN: true negative, FP: false positive, and FN: false negative 

For validation, the K-fold cross-validation method with k=5 was adopted. Cross-validation is a 
technique used in applied machine learning to estimate a machine learning model's skill on unknown data. 
That is, to use a small sample to assess how the model will perform in general when used to generate 
predictions on data that was not utilized during the model's training. It helps avoid the underfitting and 
overfitting of the proposed model by dividing the data into five subsets, each time one of the five subsets is 
used for validation, and the four other subsets are used to form a training set. K-fold cross-validation is a 
popular strategy since it's straightforward to grasp and produces a less biased or optimistic estimate of model 
competence than other approaches, such as a simple train/test split. 

The tables of results show outcoming by dataset and benchmark test. We analyze the effect of using 
the n-grams and TF-IDF and how this affects the framework's performance. Corresponding results are shown 
in Tables 4, 5, and 6. According to outcomes, we perceived that the best accuracies were related to the 
combination 1g+2g for the three benchmarked datasets. The morphology of the Arabic language relatively 
explains this result, and this is because the shortest type of sentence having a semantic meaning comprises 
two words, such as the negation case, where a negation term precedes a verb to reverse sentiment shared 
between the various Arabic dialects. 

Tables 7, 8, and 9 show accuracy measurements while using the root-stemming method and without 
stemming. The achieved results clearly show the benefits of stemming since the model performance is 
significantly improved. For the dataset MSTD, stemming words helped improve the accuracy by 0.024. 
These exciting findings arise from minimizing irrelevant features related to sentiment analysis by reducing 
the words to their root, thereby limiting the inflection level that is very high in Arabic and selecting the 
optimum feature set to be used during the evolution process by the TPOT. Moreover, we attempted to 
measure how far the evolutionary algorithm's parameters have been impacted by varying mutation rates, 
crossover rates, and the initial population size as shown in Tables 10, 11, and 12. The results were not 
affected by the variation of the mutation rate and the crossover rate, whereas the obtained measurements 
showed that accuracy was increased when increasing the population size. This enables that the model may 
explore novel pipelines through a selection of new offset springs. Due to the computational limitation, we 
have initially performed the process with a population size of 15 and later with 30. The sum of mutation rate 
and crossover rate needs to be lower than 1. We run other benchmarks with mutation rate=0.9 and crossover 
rate=0.1. 


Table 4. Effect of TF-IDF with ngrams on ArTwitter Table 5. Effect of TF-IDF with ngrams on ASTD 


TF-IDF  TF-IDF TF-IDF TF-IDF  TF-IDF TF-IDF 
lg lg+2g  1g+2g+3g lg Ig+2g 1 g+2g+3g 
Accuracy 0.857 0.862 0.857 Accuracy 0.784 0.792 0.784 
Table 6. Effect of TF-IDF with ngrams on MSTD Table 7. Effect of stemming on ArTwitter 
TF-IDF lg  TF-IDF TF-IDF Stemming=1 Stemming=0 
1g+2g¢ 1g+2g+3g Accuracy 0.862 0.812 
Accuracy 0.821 0.826 0.816 
Table 8. Effect of stemming on ASTD Table 9. Effect of stemming on MSTD 
Stemming=1 —Stemming=0 Stemming=1 Stemming=0 
Accuracy 0.792 0.758 Accuracy 0.826 0.802 


Table 10. Results of changing TPOT parameters on Table 11. Results of changing TPOT parameters 


ArTwitter on ASTD 
Mutationrate=0.5 Cross- Population Mutation Cross- Population 
over=0.4 size=30 rate=0.5 over=0.4 size=30 
Accuracy 0.862 0.872 0.863 Accuracy 0.784 0.771 0.793 
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Table 12. Results of changing TPOT parameters on MSTD 
Mutation rate=0.5  Cross-over=0.4 Population size=30 
Accuracy 0.826 0.8207 0.829 


Table 13 presents a comparison of the TPOT-based approach against other related approaches from 
the literature concerning accuracy. The comparison shows that for both datasets MSTD and ArTwitter, our 
proposed approach increased the accuracy considerably and outperforms other approaches based on 
convolutional neural network (CNN) and recurrent neural networks (RNN). On the other hand, our system 
gives comparable results for the ASTD dataset, with an accuracy of 79.3%, while the best-reported accuracy 
was given by the combined long short-term memory (LSTM) with Adam optimizer and reached 81.6%. 


Table 13. Comparison with other related works 


Dataset Approach Technique Accuracy 
MSTD [39 Farasa [44]+SVM 0.776 
Our System Root Stemmer+TF-IDF+TPOT 0.829 
ASTD [45 CNN non-static 0.759 
[46 Combined-LSTM-Mul, non-static, continuous bag of words (CBOW), Adam optimizer 0.816 
[47 Lexicon+SVM 0.751 
Our System Root stemmer +TF-IDF+TPOT 0.793 
ArTwitter [45 CNN non static 0.85 

[41 Root stemmer+S VM 

[47 Lexicon+RNN 0.85 
Our System Root stemmer +TF-IDF+TPOT 0.872 


7. CONCLUSION 

Machine-learning techniques have been widely exploited for NLP. It is challenging to find the 
correct hyperparameters and select the appropriate features. Concerning sentiment analysis, the Arabic 
language has gained widespread interest given its prevalence, prevalence, and difficulty as a morphologically 
complex language. Therefore, we have proposed a comprehensive framework for classifying sentiments as 
positive or negative and written in informal Arabic using different dialectal forms. This framework comprises 
a data preparation phase, a document cleaning process, and a stemming module. Afterward, we introduce 
preprocessed data into a TPOT-based module for the development of pipeline optimization. The results 
obtained are promising since we succeeded at improving the accuracy for the three benchmarked datasets. 
This work can be expanded to cover larger datasets involving multiple dialects at once. We also intend to 
handle evolutionary algorithms considering their significant contribution to the optimization of the sentiment 
analysis process. 
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